A Production Oriented Approach for Vandalism Detection in Wikidata - The Buffaloberry Vandalism Detector at WSDM Cup 2017

نویسندگان

  • Rafael Crescenzi
  • Marcelo Fernández
  • Federico A. Garcia Calabria
  • Pablo Albani
  • Diego Tauziet
  • Adriana Baravalle
  • Andrés Sebastián D'Ambrosio
چکیده

Wikidata is a free and open knowledge base from the Wikimedia Foundation, that not only acts as a central storage of structured data for other projects of the organization, but also for a growing array of information systems, including search engines. Like Wikipedia, Wikidata’s content can be created and edited by anyone; which is the main source of its strength, but also allows for malicious users to vandalize it, risking the spreading of misinformation through all the systems that rely on it as a source of structured facts. Our task at the WSDM Cup 2017 was to come up with a fast and reliable prediction system that narrows down suspicious edits for human revision [8]. Elaborating on previous works by Heindorf et al. we were able to outperform all other contestants, while incorporating new interesting features, unifying the programming language used to only Python and refactoring the feature extractor into a simpler and more compact code base.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ensemble Models for Detecting Wikidata Vandalism with Stacking - Team Honeyberry Vandalism Detector at WSDM Cup 2017

The WSDM Cup 2017 is a binary classification task for classifying Wikidata revisions into vandalism and non-vandalism. This paper describes our method using some machine learning techniques such as under-sampling, feature selection, stacking and ensembles of models. We confirm the validity of each technique by calculating AUC-ROC of models using such techniques and not using them. Additionally,...

متن کامل

Wikidata Vandalism Detection - The Loganberry Vandalism Detector at WSDM Cup 2017

Wikidata is the new, large-scale knowledge base of the Wikimedia Foundation. As it can be edited by anyone, entries frequently get vandalized, leading to the possibility that it might spread of falsified information if such posts are not detected. The WSDM 2017 Wiki Vandalism Detection Challenge requires us to solve this problem by computing a vandalism score denoting the likelihood that a revi...

متن کامل

Overview of the Wikidata Vandalism Detection Task at WSDM Cup 2017

We report on the Wikidata vandalism detection task at the WSDM Cup 2017. The task received five submissions for which this paper describes their evaluation and a comparison to state of the art baselines. Unlike previous work, we recast Wikidata vandalism detection as an online learning problem, requiring participant software to predict vandalism in near real-time. The best-performing approach a...

متن کامل

Large-Scale Vandalism Detection with Linear Classifiers - The Conkerberry Vandalism Detector at WSDM Cup 2017

Nowadays many artificial intelligence systems rely on knowledge bases for enriching the information they process. Such Knowledge Bases are usually difficult to obtain and therefore they are crowdsourced: they are available for everyone on the internet to suggest edits and add new information. Unfortunately, they are sometimes targeted by vandals who put inaccurate or offensive information there...

متن کامل

Building Automated Vandalism Detection Tools for Wikidata

Wikidata, like Wikipedia, is a knowledge base that anyone can edit. This open collaboration model is powerful in that it reduces barriers to participation and allows a large number of people to contribute. However, it exposes the knowledge base to the risk of vandalism and low-quality contributions. In this work, we build on past work detecting vandalism in Wikipedia to detect vandalism in Wiki...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1712.06919  شماره 

صفحات  -

تاریخ انتشار 2017